Application-Level Resilience Modeling for HPC Fault Tolerance

نویسندگان

Luanzheng Guo

Hanlin He

Dong Li

چکیده

Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides lile information on how fault tolerance happens, and RFI results are oen not deterministic due to its random nature. In this paper, we introduce a new methodology to quantify the application resilience. Our methodology is based on the observation that at the application level, the application resilience to faults is due to the application-level fault masking. e application-level fault masking happens because of application-inherent semantics and program constructs. Based on this observation, we analyze application execution information and use a data-oriented approach to model the application resilience. We use our model to study how and why HPC applications can (or cannot) tolerate faults. We demonstrate tangible benets of using the model to direct fault tolerance mechanisms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Performance Tools to Support Experiments in HPC Resilience

The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience...

متن کامل

Transparent Fault Tolerance for Job Healing in HPC Environments.

WANG, CHAO. Transparent Fault Tolerance for Job Healing in HPC Environments. (Under the direction of Associate Professor Frank Mueller). As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank ...

متن کامل

A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment

Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multipetaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to i...

متن کامل

Resilience for Collaborative Applications on Clouds - Fault-Tolerance for Distributed HPC Applications

Because e-Science applications are data intensive and require long execution runs, it is important that they feature fault-tolerance mechanisms. Cloud and grid computing infrastructures often support system and network fault-tolerance. They repair and prevent communication and software errors. They allow also checkpointing of applications, duplication of jobs and data to prevent catastrophic ha...

متن کامل

5th Workshop on System-Level Virtualization for High Performance Computing (HPCVirt 2011)

The emergence of virtualization enabled hardware, such as the latest generation AMD and Intel processors, has raised significant interest in High Performance Computing (HPC) community. In particular, system-level virtualization provides an opportunity to advance the design and development of operating systems, programming environments, administration practices, and resource management tools. Th...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1705.00267 شماره

صفحات -

تاریخ انتشار 2017

Application-Level Resilience Modeling for HPC Fault Tolerance

نویسندگان

چکیده

منابع مشابه

Using Performance Tools to Support Experiments in HPC Resilience

Transparent Fault Tolerance for Job Healing in HPC Environments.

A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment

Resilience for Collaborative Applications on Clouds - Fault-Tolerance for Distributed HPC Applications

5th Workshop on System-Level Virtualization for High Performance Computing (HPCVirt 2011)

عنوان ژورنال:

اشتراک گذاری